July 25, 2015

Data visualization - why it matters

  • Quicker exploration of data
    • Charts and plots are often easier to digest than a bunch of numbers
  • Better exploration of data
    • Humans are not so good at detecting patterns from numbers
  • Powerful messaging
    • People trust you more if you have fancy graphics to point to

Reproducibility - why it matters

  • How did you get this number? It looks way off.
    • We Ctrl-C from this other workbook which had 20 linked formula dependencies then Alt-E,S,V here, but someone modified the source file so now we have no idea what's going on.
  • The client just gave us an updated dataset, we need to update the report for the meeting tomorrow.
    • Looks like we're not sleeping tonight, we gotta re-run the SAS scripts, copy the results into Excel, format, make some charts, then copy those into Word and PowerPoint for the presentation :(

This talk

  • All examples are done in R

  • All code, including that to generate this deck, is available on GitHub

Case Study 1: Tornado data exploration

Tornado data exploration

  • Storm events data (2010 ~ 2013) from NOAA (National Oceanic and Atmospheric Administration)
list.files("download/")
## [1] "stormdata_2010.csv" "stormdata_2011.csv" "stormdata_2012.csv"
## [4] "stormdata_2013.csv"

Tornado data exploration

library(readr)
library(magrittr)
"download/stormdata_2010.csv" %>%
  read_csv %>%
  names
##  [1] "BEGIN_YEARMONTH"    "BEGIN_DAY"          "BEGIN_TIME"        
##  [4] "END_YEARMONTH"      "END_DAY"            "END_TIME"          
##  [7] "EPISODE_ID"         "EVENT_ID"           "STATE"             
## [10] "STATE_FIPS"         "YEAR"               "MONTH_NAME"        
## [13] "EVENT_TYPE"         "CZ_TYPE"            "CZ_FIPS"           
## [16] "CZ_NAME"            "WFO"                "BEGIN_DATE_TIME"   
## [19] "CZ_TIMEZONE"        "END_DATE_TIME"      "INJURIES_DIRECT"   
## [22] "INJURIES_INDIRECT"  "DEATHS_DIRECT"      "DEATHS_INDIRECT"   
## [25] "DAMAGE_PROPERTY"    "DAMAGE_CROPS"       "SOURCE"            
## [28] "MAGNITUDE"          "MAGNITUDE_TYPE"     "FLOOD_CAUSE"       
## [31] "CATEGORY"           "TOR_F_SCALE"        "TOR_LENGTH"        
## [34] "TOR_WIDTH"          "TOR_OTHER_WFO"      "TOR_OTHER_CZ_STATE"
## [37] "TOR_OTHER_CZ_FIPS"  "TOR_OTHER_CZ_NAME"  "BEGIN_RANGE"       
## [40] "BEGIN_AZIMUTH"      "BEGIN_LOCATION"     "END_RANGE"         
## [43] "END_AZIMUTH"        "END_LOCATION"       "BEGIN_LAT"         
## [46] "BEGIN_LON"          "END_LAT"            "END_LON"           
## [49] "EPISODE_NARRATIVE"  "EVENT_NARRATIVE"    "LAST_MOD_DATE"     
## [52] "LAST_MOD_TIME"      "LAST_CERT_DATE"     "LAST_CERT_TIME"    
## [55] "LAST_MOD"           "LAST_CERT"          "ADDCORR_FLG"       
## [58] "ADDCORR_DATE"

Tornado data exploration

source("R/tornado.R") # some data munging code
library(DT)
stormData %>%
  filter(type == "Tornado") %>%
  datatable(filter="top", rownames=FALSE, options=list(pageLength=5, dom='tp'))

Tornado data exploration

  • Here's a quick plot of daily tornado counts:

Tornado data exploration

  • Let's focus on the outbreak period and look at stats by state:
stormData %>%
  filter(type == "Tornado",
         date >= ymd("2011-4-25"), date <= ymd("2011-4-28")) %>%
  group_by(state) %>%
  summarize(count = n(),
            deaths = sum(deaths)) %>%
  datatable(rownames = FALSE, options = list(pageLength=5, dom='tp'))

Tornado data exploration

  • Here is a map of the tornados during the four years:

Tornado data exploration

  • Excerpt of the code for the map on the previous page:
library(leaflet)
counties %>%
  leaflet() %>%
  addTiles() %>%
  addPolygons(stroke = FALSE, smoothFactor = 0.2, 
              fillOpacity = 0.8, fillColor = ~ pal(count),
              popup = countyPopup) %>%
  addCircleMarkers(data = outbreakOccurences, 
                   lng = ~ long, lat = ~ lat, 
                   radius = ~ sqrt(10*deaths),
                   fillOpacity = 0.2, color = "blue", 
                   stroke = FALSE, popup = deathPopup)
  • Since it's just text, put it under source control!

Case study 2: Insurer-Reinsurer relationships

Insurer-Reinsurer relationships

  • Make up some (re)insurance companies and treaties:
source("R/network-graph.R")
sample_n(companies, 4)
##              company     group     size
## 21      Transamerica   primary 31.03049
## 26           Lloyd's reinsurer 38.68552
## 10              USAA   primary 22.68791
## 20 Lincoln Financial   primary 39.80798
sample_n(treaties, 4)
## Source: local data frame [4 x 3]
## 
##   cedant reinsurer premiumCeded
## 1     20        23     9.002772
## 2     11        28     8.918590
## 3      3        28     8.339355
## 4     13        27     7.323169

Insurer-reinsurer relationships

forceNetwork(Links = treaties, Nodes = companies, 
             Source = "cedant", Target = "reinsurer",
             Value = "premiumCeded", NodeID = "company", Nodesize = "size",
             Group = "group", opacity = 0.8,
             colourScale = "d3.scale.category10()")

Case study 3: Predictive modeling

Predictive modeling

  • For this case study, we'll use the Insurance dataset from the MASS package.
##    District  Group   Age Holders Claims logHolders fold
## 1         1    <1l   <25     197     38   5.283204    1
## 2         1    <1l 25-29     264     35   5.575949    2
## 3         1    <1l 30-35     246     20   5.505332    1
## 4         1    <1l   >35    1680    156   7.426549    2
## 5         1 1-1.5l   <25     284     63   5.648974    1
## 6         1 1-1.5l 25-29     536     84   6.284134    2
## 7         1 1-1.5l 30-35     696     89   6.545350    2
## 8         1 1-1.5l   >35    3582    400   8.183677    2
## 9         1 1.5-2l   <25     133     19   4.890349    3
## 10        1 1.5-2l 25-29     286     52   5.655992    3
  • We'll fit poisson regression models to predict claim count and validate results graphically

Predictive modeling

  • Here's a 3-fold relativity table. If the model is stable we would expect the estimates to be similar across different samples.

Predictive modeling

  • Here's an example where a picture more appropriate than numbers

Predictive modeling

  • Here is an out-of-sample predicted vs. actual plot, which shows how the model performs in each quantile range of the predicted values:

Predictive modeling

BTW, in a real life predictive modeling exercise, consider

  • having a hold-out validation set and looking at quantitative error metrics

  • using other techniques in addition to or instead of GLMs, unless you're constrained by regulation

  • not using stepwise regression and using regularization if you're stuck doing GLMs

  • not looking at p-values and frequentist CIs

That's it!